Lecture 10: Point Level Models - Model Fitting, cont..

Class Intro

Intro Questions

  • Discuss how the spatial range could be extended to \(\mathcal{R}^2\) rather than \(\mathcal{R}^1\) in the Gaussian Process setting. What changes in this situation?
  • For Today:
    • More Model Fitting
    • Hierarchical Models Intro

Universal Kriging

  • When covariate information is available for inclusion in the analysis, this is often referred to as universal kriging
  • Now we have \[\boldsymbol{Y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}, \; \; \text{ where } \boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \Sigma)\]
  • The conditional distributions are very similar to what we have derived above, watch for HW question.

  • In each case, kriging or universal kriging, it is still necessary to estimate the following parameters: \(\sigma^2,\) \(\tau^2\), \(\phi\), and \(\mu\) or \(\beta\).
  • This can be done with least-squares methods or in a Bayesian framework.

Parameter Estimation

About those other parameters

We still need

  • to choose an appropriate covariance function (or semivariogram)
  • and estimate parameters in that function

Functions in R for 2D kriging

  • The krige function in gstat contains a function for kriging ; however, again this requires a known variogram.

Functions in R for 2D kriging

  • The krige function in gstat contains a function for kriging ; however, again this requires a known variogram.
## [using ordinary kriging]

2D Simulation

  • Using the simulated spatial process, plot an empirical variogram.

Variogram Fitting: Optimization to Empirical Variogram

## variofit: covariance model used is exponential 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim
## Warning in variofit(emp.vario, cov.model = "exponential"): initial values
## not provided - running the default search
## variofit: searching for best initial value ... selected values:
##               sigmasq phi   tausq  kappa
## initial.value "0.84"  "0"   "0.17" "0.5"
## status        "est"   "est" "est"  "fix"
## loss value: 6873.17015373233
## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: exponential
## parameter estimates:
##   tausq sigmasq     phi 
##  0.1797  0.8546  0.0000 
## Practical Range with cor=0.05 for asymptotic range: 0.0001159668
## 
## variofit: minimised weighted sum of squares = 6633.819

Variogram Fitting: Optimization to Empirical Variogram

## variofit: covariance model used is exponential 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim
## Warning in variofit(emp.vario2, cov.model = "exponential"): initial values
## not provided - running the default search
## variofit: searching for best initial value ... selected values:
##               sigmasq phi    tausq kappa
## initial.value "1.14"  "1.09" "0"   "0.5"
## status        "est"   "est"  "est" "fix"
## loss value: 2384.34387917209
## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: exponential
## parameter estimates:
##   tausq sigmasq     phi 
##  0.0189  1.0468  0.7845 
## Practical Range with cor=0.05 for asymptotic range: 2.350179
## 
## variofit: minimised weighted sum of squares = 1035.009

Variogram Fitting: Exercise

  • Write code to simulate spatial processes with an exponential covariance function and evaluate the fitted values for the parameters.
  • Vary the following parameters: number of samples (25, 100, 500) and replicate each 10 times. Discuss how close the fitted samples are to the true values.

Variogram Fitting: Solution \(\sigma^2\)

## Warning: Removed 35 rows containing non-finite values (stat_ydensity).

Variogram Fitting: Solution \(\tau^2\)

Variogram Fitting: Solution \(\phi\)

## Warning: Removed 43 rows containing non-finite values (stat_ydensity).

Hierarchical Modeling for Point Referenced Data

Spatial Process Theory

  • Specifying a spatial process is concerned with a collection of random variables \({Y(\boldsymbol{s}):\boldsymbol{s} \in \mathcal{R}^2}\)
  • Recall, \(\boldsymbol{s} \in \mathcal{R}^2\) requires an infinite dimensional distribution.
  • This is typically acheived using a Gaussian process, which enables use of a multivariate normal distribution and only requires specifying the mean and covariance function.

Gaussian Process

  • In practice, the process is only observed at a finite number of locations, \(\{\boldsymbol{s_1}, \dots, \boldsymbol{s_n}\}\)
  • Using \(\{Y(\boldsymbol{s_1}), \dots, Y(\boldsymbol{s_n})\}\) the goal is to learn the mean and covariance structure of the latent spatial process using observed data
  • and make predictions of \(Y(\boldsymbol{s_0})\)
  • A common statisical modeling approach uses Bayesian hierarchical models

Connections and Asymptotics

  • This is quite similar to dynamic linear models where there is a latent state-space specification.
  • Asymptotics focuses on the convergence of estimators with large samples. In spatial statistics, asymptotics are focused on spatial infill. That is, considering the properties of the estimators as the distance between the points goes to zero.

Bayesian Statistics Overview: Sampling Model

  • Bayesian statistics assumes that the unknown parameters are a random variable - in addition to observed data.
  • As with classical statistical techniques, fitting a model requires specifying the statistical distribution for the observed data \(\boldsymbol{y} = \{y_1, \dots, y_n\}\) given parameters \(\boldsymbol{\theta}\), denoted as \(p(\boldsymbol{y}|\boldsymbol{\theta})\).
  • \(p(\boldsymbol{y}|\boldsymbol{\theta})\) is often referred to as a sampling model.
  • For instance, with a regression model, \[\boldsymbol{y}|\boldsymbol{\beta}, \boldsymbol{x} \sim ??\]

Bayesian Statistics Overview: Sampling Model

  • With a regression model, typically the residuals are assumed to follow a normal distribution. \[\boldsymbol{Y_i} = {\boldsymbol{\beta}}^T x_i + \boldsymbol{\epsilon_i}, \; \;\boldsymbol{\epsilon_i} \sim N(0, \sigma^2)\] Hence, \[\boldsymbol{y}|\boldsymbol{\beta}, \boldsymbol{x} \sim N(X\boldsymbol{\beta}, \sigma^2 I)\]
  • In regression, the goal is to make inferences about the parameters \(\boldsymbol{\theta} = \{\boldsymbol{\beta}, \sigma^2\}\)

Bayesian Statistics Overview: Prior Distribution

  • Inference in Bayesian statistics is distributional, rather than a single point.
  • Distributional beliefs are iteratively updated in the presence of new data

  • A prior distribution, \(p(\boldsymbol{\theta}|\boldsymbol{\lambda})\) is the belief about the parameters(\(\boldsymbol{\theta}\)) before collecting data.

  • A set of hyperparameters, \(\boldsymbol{\lambda}\) can be used to define the prior distribution.

Bayesian Statistics Overview: Prior Distribution

  • Assume we are interested in learning the relationship between the cost of a pizza in Bozeman and the number of slices in that pizza.
  • Write out and specify a sampling model for this regression problem.

  • Sketch and/or define a prior distribution for your parameters in the sampling model.

Bayesian Statistics Overview: Prior Distribution for Pizza

Bayesian Statistics Overview: Posterior Distribution

  • The combination of the prior distribution, which is then updated after observing data, results in the posterior distribution.
  • Inference about the parameters is based on the posterior distribution.
  • Ignoring \(\boldsymbol{\lambda}\) for now, the posterior distribution is defined as \[p(\boldsymbol{\theta}|\boldsymbol{y}) = \frac{p(\boldsymbol{y},\boldsymbol{\theta})}{p(\boldsymbol{y})} = \frac{p(\boldsymbol{y},\boldsymbol{\theta})}{\int p(\boldsymbol{y},\boldsymbol{\theta}) d \theta} = \frac{\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta})}{\int \mathcal{L}(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta}) d \theta}\]

Bayesian Statistics Overview: Posterior Distribution for Pizza

  • We observe a data point where a pizza with 8 slices sells for $12.

Bayesian Statistics Overview: Hierachical Posterior Distribution

  • Recall that \(\boldsymbol{\lambda}\) are hyperparameters in the prior distribution \(p(\boldsymbol{\theta}|\boldsymbol{\lambda})\).
  • For instance, we might say that \(\theta|\lambda \sim N(\lambda,1)\)
  • Then we also need prior distributions (hyperpriors) for \(\lambda\), \(p(\lambda)\)
  • Then the posterior is \[p(\boldsymbol{\theta}|\boldsymbol{y}) = \frac{p(\boldsymbol{y},\boldsymbol{\theta})}{p(\boldsymbol{y})} = \frac{\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta}|\boldsymbol{\lambda})p(\boldsymbol{\lambda})}{\int \int \mathcal{L}(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta}|\boldsymbol{\lambda})p(\boldsymbol{\lambda})d \boldsymbol{\theta} d \boldsymbol{\lambda}}\]

Hierarchical Model

There are three levels of this (hierarchical) model

  1. \(p(\boldsymbol{y}|\boldsymbol{\theta})\) [data | process]
  2. \(p(\boldsymbol{\theta}|\boldsymbol{\lambda})\) [process | parameters]
  3. \(p(\boldsymbol{\lambda})\) [parameters]

More about Bayes

Bannerjee, Geland, and Carlin state, "Bayesian inferential paradigm offers potentially attractive advantages over the classical, frequentist statistical approach through

  • its more philosophically sound foundation,
  • its unified approach to data analysis,
  • and its ability to incorporate prior opinion via the prior distribution.

Stationary Spatial Process

-The model for a Gaussian process can be written as \[Y(\boldsymbol{s}) = \mu(\boldsymbol{s}) + w(\boldsymbol{s}) + \epsilon(\boldsymbol{s}),\] where \(\mu(\boldsymbol{s}) = x(\boldsymbol{s})^t\boldsymbol{\beta}\) is the mean structure. Then the residual can be partitioned into two pieces: a spatial component \(w(\boldsymbol{s})\) and a non-spatial component \(\epsilon(\boldsymbol{s}).\)

  • We assume \(w(\boldsymbol{s})\) are realizations from a Gaussian Process (GP) with mean zero that models residual spatial structure.

  • Then \(\epsilon(\boldsymbol{s})\) are uncorrelated error terms.

  • Q: how do \(w(\boldsymbol{s})\) + \(\epsilon(\boldsymbol{s})\) relate to the partial sill, range, and nugget?

\(w(\boldsymbol{s})\) and \(\epsilon(\boldsymbol{s})\)

  • The partial sill, \(\sigma^2\), and the range, \(\phi\), are modeled with \(w(\boldsymbol{s})\)
  • The nugget is contained in the \(\epsilon(\boldsymbol{s})\) term.
  • This model assumes a stationary model - in that the correlation is only a function of the separation between points.
  • Furthermore, if the correlation is only a function of the distance between points, this is also isotropic.

Model Specification

  • Let \(\Sigma = \sigma^2 H(\phi) + \tau^2 I\)
  • Define \(\boldsymbol{\theta} = (\boldsymbol{\beta}, \sigma^2, \tau^2, \phi)\)
  • Then the sampling model can be written as:
  • \(\boldsymbol{Y}| \boldsymbol{\theta} \sim N(X \boldsymbol{\beta}, \sigma^2 H(\phi) + \tau^2)\)
  • Given a (set of) prior distribution(s), \(p(\theta)\), the posterior distribution of the parameters can be computed (more later) as \(p(\theta|\boldsymbol{y})\).

Model Specification as Hierarchical Model

  • The model can be rewritten as \[\boldsymbol{Y}| \boldsymbol{\theta}, \boldsymbol{W} \sim N(X \boldsymbol{\beta} + \boldsymbol{W}, \tau^2 I), \; \;\;\;\;\; \text{ [data|process, parameters]}\] where \(\boldsymbol{W} = (w(\boldsymbol{s_1}), \dots , w(\boldsymbol{s_1}))^T\) is a vector of spatial random effects.
  • The second-stage, or the process, is \[\boldsymbol{W}|\sigma^2, \phi \sim N(\boldsymbol{0},\sigma^2 H(\phi))\; \;\;\;\;\; \text{ [process | parameters]}\]
  • The third level is the prior specification: \(p(\theta)\) [parameters]

What about those prior distributions??

Next, prior distributions are necessary for

  • \(\boldsymbol{\beta}\)
  • \(\sigma^2\)
  • \(\phi\)
  • \(\tau^2\)